This code chunk loads all required R packages and sets a global chunk option for reproducibility. Tidyverse and dplyr are used for data cleaning, reshaping, and merging multiple datasets, which is key given the integration of eviction records, economic indicators, and demographic data. Spatial components are handled using sf and tigris to manipulate census geographies and enable tract-level analysis, while tidycensus facilitates the integration of standardized Census attributes where applicable. Visualization and diagnostic tasks rely on ggplot2, patchwork, and scales, allowing for consistent exploratory data analysis and error assessment. Finally, broom supports tidy model output for interpretation, and readxl is used to import Excel-formatted documents that could not be reliably converted to CSV. Together.
This code chunk imports all primary datasets used in the analysis, bringing together eviction records, demographic context, economic indicators, and measures of household financial vulnerability. The core eviction data are sourced from evictionlab.org . Multiple eviction tables are loaded to support different analytical needs, including raw filing counts (tract_filings), normalized filing rates (tract_filing_rate), and citywide temporal patterns relative to historical baselines (eviction_trends). These complementary formats allow the project to examine evictions at both the spatial (tract-level) and temporal (month-to-month) scales, which is critical for understanding changes in eviction activity.
To contextualize eviction behavior within broader economic and social conditions, the project incorporates several external data sources. Monthly unemployment rates and Consumer Price Index (CPI) data capture macroeconomic conditions that directly affect households’ ability to meet rent obligations. This data is sourced from the Bureau of Labor Statistics. The Urban Institute financial health dataset provides indicators of underlying financial vulnerability, such as debt burdens, housing cost stress, and labor market engagement. Additional datasets from the eviction lab describing neighborhood demographics, eviction hotspots, and deviations from historical filing baselines support exploratory analysis and feature engineering during later modeling stages. These datasets enable a transparent, multi-source approach to integrate diverse data sources in support of a policy-relevant predictive model.
# Changes in Claim Amount
claim_change <- read.csv("eviction_data/changes_claims.csv")
# Demographics
demographics <- read.csv("eviction_data/demographics.csv")
# Hotspots
hotspots <- read.csv("eviction_data/eviction_hotspots.csv")
# Evictions Relative to Baseline
evictions_baseline <- read.csv("eviction_data/filings_relative_baseline.csv")
# Tract Level Filings
tract_filings <- read.csv("eviction_data/tract_level_filings.csv")
# Tract Level Filing Rate
tract_filing_rate <- read.csv("eviction_data/tract_level_filing_rate.csv")
# Eviction Trends
eviction_trends <- read.csv("eviction_data/trends_eviction.csv")
# Monthly Unemployment Rates
unemployment_monthly <- read_xlsx("eviction_data/unemployment_rate_montly.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
# Consumer Price Index
consumer_price_index <- read_xlsx("eviction_data/consumer_price_index_monthly.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
#Financial Health Data
financial_health <- read_xlsx("eviction_data/financial-health-of-residents-data.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
## • `` -> `...10`
## • `` -> `...11`
## • `` -> `...12`
## • `` -> `...13`
## • `` -> `...14`
## • `` -> `...15`
## • `` -> `...16`
## • `` -> `...17`
## • `` -> `...18`
To begin we examine tract filing data.
# Investigate the data set
head(tract_filings, 50)
## id month_date month_filings month_rate month_rate_scale_max
## 1 4.2101e+10 01/11/2024 11 0.012835473 0.1248843
## 2 4.2101e+10 01/11/2024 8 0.006950478 0.1248843
## 3 4.2101e+10 01/11/2024 5 0.008503401 0.1248843
## 4 4.2101e+10 01/11/2024 51 0.030520646 0.1248843
## 5 4.2101e+10 01/11/2024 9 0.007594937 0.1248843
## 6 4.2101e+10 01/11/2024 3 0.013761468 0.1248843
## 7 4.2101e+10 01/11/2024 6 0.005067568 0.1248843
## 8 4.2101e+10 01/11/2024 37 0.038988409 0.1248843
## 9 4.2101e+10 01/11/2024 24 0.028846154 0.1248843
## 10 4.2101e+10 01/11/2024 11 0.011022044 0.1248843
## 11 4.2101e+10 01/11/2024 7 0.005723630 0.1248843
## 12 4.2101e+10 01/11/2024 9 0.014876033 0.1248843
## 13 4.2101e+10 01/11/2024 11 0.009548611 0.1248843
## 14 4.2101e+10 01/11/2024 2 0.001339585 0.1248843
## 15 4.2101e+10 01/11/2024 16 0.014311270 0.1248843
## 16 4.2101e+10 01/11/2024 18 0.012775018 0.1248843
## 17 4.2101e+10 01/11/2024 6 0.005016722 0.1248843
## 18 4.2101e+10 01/11/2024 1 0.001876173 0.1248843
## 19 4.2101e+10 01/11/2024 12 0.018237082 0.1248843
## 20 4.2101e+10 01/11/2024 10 0.006925208 0.1248843
## 21 4.2101e+10 01/11/2024 5 0.006345178 0.1248843
## 22 4.2101e+10 01/11/2024 2 0.001863933 0.1248843
## 23 4.2101e+10 01/11/2024 3 0.003787879 0.1248843
## 24 4.2101e+10 01/11/2024 11 0.008849558 0.1248843
## 25 4.2101e+10 01/11/2024 7 0.025547445 0.1248843
## 26 4.2101e+10 01/11/2024 5 0.005917160 0.1248843
## 27 4.2101e+10 01/11/2024 3 0.003070624 0.1248843
## 28 4.2101e+10 01/11/2024 8 0.011004127 0.1248843
## 29 4.2101e+10 01/11/2024 3 0.004587156 0.1248843
## 30 4.2101e+10 01/11/2024 1 0.001956947 0.1248843
## 31 4.2101e+10 01/11/2024 14 0.016990291 0.1248843
## 32 4.2101e+10 01/11/2024 18 0.025862069 0.1248843
## 33 4.2101e+10 01/11/2024 14 0.032183908 0.1248843
## 34 4.2101e+10 01/11/2024 2 0.006410256 0.1248843
## 35 4.2101e+10 01/11/2024 22 0.036974790 0.1248843
## 36 4.2101e+10 01/11/2024 7 0.012433393 0.1248843
## 37 4.2101e+10 01/11/2024 21 0.016733068 0.1248843
## 38 4.2101e+10 01/11/2024 18 0.022304833 0.1248843
## 39 4.2101e+10 01/11/2024 6 0.007662835 0.1248843
## 40 4.2101e+10 01/11/2024 4 0.007207207 0.1248843
## 41 4.2101e+10 01/11/2024 1 0.001288660 0.1248843
## 42 4.2101e+10 01/11/2024 2 0.001970443 0.1248843
## 43 4.2101e+10 01/11/2024 3 0.003880983 0.1248843
## 44 4.2101e+10 01/11/2024 3 0.005145798 0.1248843
## 45 4.2101e+10 01/11/2024 5 0.008833922 0.1248843
## 46 4.2101e+10 01/11/2024 25 0.030637255 0.1248843
## 47 4.2101e+10 01/11/2024 43 0.064275037 0.1248843
## 48 4.2101e+10 01/11/2024 69 0.046030687 0.1248843
## 49 4.2101e+10 01/11/2024 56 0.053639847 0.1248843
## 50 4.2101e+10 01/11/2024 25 0.028280543 0.1248843
## month_diff pct_white pct_black pct_hispanic
## 1 0.6315789 0.7502917 0.03733956 0.075845974
## 2 0.6428571 0.7801911 0.07819288 0.078192876
## 3 1.0000000 0.3163265 0.00000000 0.224489796
## 4 1.2380952 0.5188510 0.04727708 0.052663076
## 5 0.4444444 0.5021097 0.25485232 0.033755274
## 6 1.0000000 0.9128440 0.00000000 0.000000000
## 7 0.7000000 0.6748311 0.06081081 0.016891892
## 8 0.8941176 0.4868282 0.14963119 0.050579557
## 9 1.6666667 0.6826923 0.08293269 0.049278846
## 10 0.6857143 0.7785571 0.02304609 0.050100200
## 11 1.2307692 0.3924775 0.05887163 0.080948487
## 12 1.3333333 0.7504132 0.00000000 0.069421488
## 13 0.7500000 0.6545139 0.03472222 0.072048611
## 14 0.4615385 0.6945747 0.00000000 0.050904220
## 15 1.2142857 0.7504472 0.01967800 0.057245081
## 16 1.1875000 0.7537260 0.08161817 0.044712562
## 17 1.0000000 0.6446488 0.07692308 0.040133779
## 18 0.5000000 0.9249531 0.00000000 0.016885553
## 19 1.5294118 0.9270517 0.02127660 0.031914894
## 20 0.9166667 0.6821330 0.10595568 0.038088643
## 21 0.9230769 0.7677665 0.05837563 0.015228426
## 22 0.7500000 0.8462255 0.05684995 0.027027027
## 23 2.0000000 0.7462121 0.02146465 0.157828283
## 24 1.1428571 0.7272727 0.05390185 0.049074819
## 25 0.9411765 0.5510949 0.35766423 0.000000000
## 26 1.2000000 0.6071006 0.25562130 0.056804734
## 27 0.5333333 0.5527124 0.36028659 0.015353122
## 28 1.2857143 0.5997249 0.14305365 0.038514443
## 29 0.5000000 0.8562691 0.09174312 0.000000000
## 30 0.2666667 0.7534247 0.13894325 0.086105675
## 31 1.1538462 0.7257282 0.14563107 0.009708738
## 32 1.1515152 0.5775862 0.34339080 0.040229885
## 33 0.8333333 0.2068966 0.69425287 0.011494253
## 34 0.3750000 0.4679487 0.38782051 0.060897436
## 35 1.0000000 0.4672269 0.42352941 0.013445378
## 36 1.0666667 0.7353464 0.01776199 0.177619893
## 37 1.2941176 0.6199203 0.18486056 0.075697211
## 38 0.5671642 0.3308550 0.48946716 0.099132590
## 39 0.6086957 0.5683269 0.07279693 0.204342273
## 40 0.7142857 0.7747748 0.13333333 0.082882883
## 41 0.4444444 0.4909794 0.01159794 0.295103093
## 42 0.4285714 0.7517241 0.12807882 0.150738916
## 43 1.0000000 0.8201811 0.02846054 0.119016818
## 44 0.7272727 0.4322470 0.14236707 0.174957118
## 45 0.8000000 0.5812721 0.30565371 0.070671378
## 46 0.9122807 0.2132353 0.60416667 0.064950980
## 47 1.2054795 0.1793722 0.72944694 0.065769806
## 48 0.9333333 0.5150100 0.36691127 0.099399600
## 49 0.9421488 0.1475096 0.83429119 0.000000000
## 50 0.8125000 0.2692308 0.58597285 0.083710407
Next We import census geometry to join and visualize tract filing data
# Import Census Tract Geometry
options(tigris_use_cache = TRUE)
# Import for Philadelphia
philadelphia_tracts <- tracts(state = "PA", county = "Philadelphia", year = 2022)
# Look at Results
plot(philadelphia_tracts$geometry)
# Join Tracts to Filings
# Make Sure Format Matches
philadelphia_tracts$GEOID <- as.character(philadelphia_tracts$GEOID)
tract_filings$id <- as.character(tract_filings$id)
# Join
evictions_geo <- philadelphia_tracts %>%
left_join(tract_filings, by = c("GEOID" = "id"))
Here a map is made of tract filings.
ggplot(evictions_geo) +
geom_sf(aes(fill = month_rate), color = NA) +
scale_fill_viridis_c(option = "magma") +
labs(
title = "Eviction Rate by Census Tract",
fill = "Eviction Rate"
) +
theme_minimal()
OTHER EDAS GO HERE
#Hotspots
library(sf)
library(ggplot2)
library(readr)
library(dplyr)
library(sf)
library(ggplot2)
# 1. Read CSV
hotspots <- read_csv("hotspots/philadelphia_hotspots_media_report.csv")
## Rows: 75 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): time_period, end_date, xplaintiff, xstreet_clean
## dbl (5): position, filings, top100, lat, lon
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 2. Convert to sf object
hotspots_sf <- st_as_sf(
hotspots,
coords = c("lon", "lat"),
crs = 4326
)
# 1. Read tract boundary
tracts <- st_read("border/Census_Blocks_2010.geojson")
## Reading layer `Census_Blocks_2010' from data source
## `E:\Upenn course\policy\portfolio-setup-YanyangChen-penn\assignments\final_project_analysis_process\border\Census_Blocks_2010.geojson'
## using driver `GeoJSON'
## Simple feature collection with 408 features and 27 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -75.28027 ymin: 39.86705 xmax: -74.95578 ymax: 40.13799
## Geodetic CRS: NAD83
# 2. Reproject hotspots to match tracts CRS
hotspots_sf <- st_transform(hotspots_sf, st_crs(tracts))
# 3. Map tracts + hotspots
ggplot() +
geom_sf(data = tracts, fill = "grey95", color = "white") +
geom_sf(
data = hotspots_sf,
aes(size = filings),
alpha = 0.7
) +
scale_size_continuous(name = "Eviction filings") +
labs(
title = "Eviction Hotspots over Census Tracts",
subtitle = "Top 100 buildings in Philadelphia",
caption = "Source: Eviction Tracking System, ACS, Census tracts"
) +
theme_minimal()
Eviction filings exhibit marked geographic concentration in
Philadelphia. North Philadelphia (40.05°N-40.15°N) shows the densest
clustering with 100+ filings, reflecting historically disinvested,
predominantly Black neighborhoods. West Philadelphia and South
Philadelphia display secondary hotspots (50-75 filings), while Northeast
suburbs show minimal activity. This stark contrast demonstrates housing
instability is geographically concentrated among vulnerable populations.
For predictive modeling, spatial clustering indicates temporal lag
effects and racialized patterns reflecting structural inequality. The
geographic specificity enables targeted prevention interventions in
highest-need areas, though we acknowledge filings may reflect both
tenant vulnerability and landlord enforcement variation.
Dataset contains 72 monthly observations from January 2020 to December 2024 (5 years) with eviction filing counts, historical averages, and deviation metrics. Key Findings Temporal Trends: Eviction filings show clear recovery pattern post-2020. 2020-2021 experienced dramatic suppression (COVID-19 moratoriums), with filings dropping to near-zero in April-June 2020. Recovery accelerated from 2022 onwards, stabilizing around 1,000-1,200 filings monthly by 2023-2024. Seasonality: Weak seasonal pattern exists—January and August show slightly elevated filings (1.55× and 1.55× of average), while April-May show lower activity (0.20× and 0.16× of average 2020-2021 data). This reflects policy implementation timing. Year-over-Year Comparison: 2024 average (1,061 filings) closely mirrors 2023 (1,059 filings), indicating market stabilization post-recovery. 2025 preliminary data suggests continued stability with minor volatility. Volatility: Coefficient of variation decreased over time, showing market normalization after pandemic disruptions. Recent years exhibit more predictable patterns than 2020-2021 volatility. Implications Data supports stable eviction environment with predictable monthly volumes suitable for forecasting models.
hiladelphia Census Tracts: Geospatial Analysis These choropleth maps visualize key socioeconomic indicators across Philadelphia’s census tracts, revealing significant spatial disparities in poverty rates, unemployment, rent burden, and demographic composition. The darker regions indicate higher concentrations of economic vulnerability, with notable clustering in North and West Philadelphia tracts showing elevated poverty, unemployment, and rental cost burdens exceeding 30% of household income. Demographic analysis reveals stark residential segregation patterns, with distinct geographic concentrations of racial and ethnic populations. These spatial patterns align with historical inequities in housing access, economic opportunity, and community investment, providing critical context for understanding eviction vulnerability and housing instability across the city.
The dependent variable in this analysis is percentage_diff, which measures the monthly deviation of eviction filings from a long-term baseline. Rather than modeling raw eviction counts, this transformation captures relative change in eviction activity, allowing the analysis to focus on fluctuations around typical levels. This is particularly important in Philadelphia and other large cities, where eviction volumes vary seasonally and were disrupted by major economic disruptions such as the COVID-19. By using a normalized measure, the model isolates meaningful economic variation instead of conflating structural patterns with short-term dynamics.
From an economic perspective, percentage_diff is better suited than raw counts because it reflects marginal responses to changing conditions, which is the core interest of applied economic analysis. Labor market shocks, inflation, and household financial stress do not directly determine how many eviction cases exist in absolute terms; instead, they influence whether eviction pressure rises or falls relative to normal conditions. Modeling deviations from a baseline aligns the dependent variable with economic theory on cyclical stress, making the results more applicable for policy analysis and more appropriate for regression-based inference.
This restructures the economic and financial datasets to make them analytically compatible with monthly eviction data. The Urban Institute financial health dataset comes in a report-style format rather than a machine-ready table, with multiple header and metadata rows embedded in the file. To correct this, we make row 4 the column headers, replace missing or blank column names with placeholders to allow selection and filtering, remove non-data rows, and isolate the row corresponding to Philadelphia. All remaining variables are then changed to numeric format, ensuring that these indicators—such as debt burden, housing cost stress, and labor force participation—can be used in regression-based modeling. These steps are necessary because model estimation and validation require numeric, consistently formatted inputs.
The Consumer Price Index and unemployment datasets require a different but similar cleaning approach because both are come in wide, month-column formats with extensive metadata. We identify the true header rows, remove non-data rows, convert month columns to numeric values, and reshape each dataset into a long format with one row per year–month observation. These transformations are important because the eviction outcome is measured monthly, and predictive modeling requires all time-varying variables to align. A month lookup table is made to convert month names into numeric values, ensuring consistent joins across datasets. The cleaned CPI and unemployment data are then merged into the monthly eviction trend dataset using year and month keys.
Finally, the cleaned Urban Institute financial indicators are appended to the eviction trends dataset using a dummy join, reflecting the fact that these measures describe baseline household financial vulnerability rather than month-specific disruptions. Lagged versions of key eviction outcome variables are then constructed to capture temporal lag in eviction activity, a main feature of eviction processes given legal delays and accumulated financial stress. These lag variables enable the model to distinguish between short-term economic fluctuations and ongoing eviction patterns.
Note: Thus far we had not had any issues with the Urban Institute Data, later will reveal why it is more difficult for this analysis
# Data Cleaning
# Clean Financial Health Data
financial_health_clean <- financial_health
# Use row 4 as header
names(financial_health_clean) <- as.character(unlist(financial_health_clean[4, ]))
## Warning: The `value` argument of `names<-()` can't be empty as of tibble 3.0.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Replace blank/NA column names with placeholders
bad_name_index <- which(is.na(names(financial_health_clean)) | names(financial_health_clean) == "")
if (length(bad_name_index) > 0) {
names(financial_health_clean)[bad_name_index] <- paste0("col_", seq_along(bad_name_index))
}
# Remove metadata rows 1–4, keep Philadelphia only
financial_health_clean <- financial_health_clean[-(1:4), ] %>%
filter(City == "Philadelphia, PA")
# Convert all but City to numeric
financial_health_numeric_columns <- setdiff(names(financial_health_clean), "City")
financial_health_clean[financial_health_numeric_columns] <-
lapply(financial_health_clean[financial_health_numeric_columns], as.numeric)
# Clean CPI Data
cpi_wide <- consumer_price_index
# Real header is row 11
names(cpi_wide) <- as.character(unlist(cpi_wide[11, ]))
cpi_wide <- cpi_wide[-(1:11), ]
cpi_wide$Year <- as.integer(as.numeric(cpi_wide$Year))
cpi_month_names <- c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec")
# Convert month columns to numeric
for (col in cpi_month_names) {
if (col %in% names(cpi_wide)) {
cpi_wide[[col]] <- as.numeric(cpi_wide[[col]])
}
}
# Month lookup table
month_lookup <- data.frame(
month_name = cpi_month_names,
month = 1:12
)
# Convert CPI to long format
cpi_long <- cpi_wide %>%
select(Year, all_of(cpi_month_names)) %>%
pivot_longer(cols = all_of(cpi_month_names),
names_to = "month_name",
values_to = "cpi") %>%
filter(!is.na(cpi)) %>%
left_join(month_lookup, by = "month_name") %>%
rename(year = Year)
# Clean Unemployment Data
unemployment_wide <- unemployment_monthly
# Row 10 is the real header
names(unemployment_wide) <- as.character(unlist(unemployment_wide[10, ]))
unemployment_wide <- unemployment_wide[-(1:10), ]
unemployment_wide$Year <- as.integer(as.numeric(unemployment_wide$Year))
unemployment_month_names <- c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec")
# Convert month columns to numeric
for (col in unemployment_month_names) {
if (col %in% names(unemployment_wide)) {
unemployment_wide[[col]] <- as.numeric(unemployment_wide[[col]])
}
}
# Convert unemployment data to long format
unemployment_long <- unemployment_wide %>%
select(Year, all_of(unemployment_month_names)) %>%
pivot_longer(cols = all_of(unemployment_month_names),
names_to = "month_name",
values_to = "unemployment_rate") %>%
filter(!is.na(unemployment_rate)) %>%
left_join(month_lookup, by = "month_name") %>%
rename(year = Year)
# Clean Monthly Eviction Trend Data
# Parse month and year
eviction_trends_clean <- eviction_trends %>%
mutate(
month_num = as.integer(substr(month, 1, 2)),
year = as.integer(substr(month, 4, 7)),
month_date = paste0(year, "-", month_num, "-01")
)
# Merge with unemployment and CPI
eviction_trends_merged <- eviction_trends_clean %>%
left_join(unemployment_long,
by = c("year" = "year", "month_num" = "month")) %>%
left_join(cpi_long,
by = c("year" = "year", "month_num" = "month"))
# Attach Urban Institute financial variables
eviction_trends_merged <- eviction_trends_merged %>%
mutate(dummy = 1) %>%
left_join(financial_health_clean %>% mutate(dummy = 1), by = "dummy") %>%
select(-dummy)
# Add lag variables
eviction_trends_final <- eviction_trends_merged %>%
arrange(year, month_num) %>%
mutate(
l1_filings = lag(month_filings, 1),
l1_percentage_diff = lag(percentage_diff, 1)
)
Here we check to see if our new data sets are how we intended them.
# Check Financial Health
cat("Financial Health Clean")
## Financial Health Clean
print(str(financial_health_clean))
## tibble [1 × 18] (S3: tbl_df/tbl/data.frame)
## $ City : chr "Philadelphia, PA"
## $ Overall : num 628
## $ White areas : num 703
## $ Nonwhite areas : num 571
## $ col_1 : num NA
## $ % with : num 0.466
## $ Median amount among those with : num 1476
## $ col_2 : num NA
## $ % with home foreclosure : num 0.00435
## $ % housing-cost burdened, low-income : num 0.68
## $ % unbanked, metro area : num 0.041
## $ % with health insurance coverage : num 0.857
## $ % received EITC, low-income : num 0.415
## $ Unemployment rate : num 0.109
## $ Labor force participation rate : num 0.604
## $ % below 200% of federal poverty level: num 0.46
## $ % population change, 2000–15 : num 0.0329
## $ Gini index of income inequality : num 0.505
## NULL
print(head(financial_health_clean))
## # A tibble: 1 × 18
## City Overall `White areas` `Nonwhite areas` col_1 `% with`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Philadelphia, PA 628 703 571 NA 0.466
## # ℹ 12 more variables: `Median amount among those with` <dbl>, col_2 <dbl>,
## # `% with home foreclosure` <dbl>,
## # `% housing-cost burdened, low-income` <dbl>,
## # `% unbanked, metro area` <dbl>, `% with health insurance coverage` <dbl>,
## # `% received EITC, low-income` <dbl>, `Unemployment rate` <dbl>,
## # `Labor force participation rate` <dbl>,
## # `% below 200% of federal poverty level` <dbl>, …
# Check CPI Long Format
cat("CPI Long Format")
## CPI Long Format
print(str(cpi_long))
## tibble [64 × 4] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:64] 2015 2015 2015 2015 2015 2015 2016 2016 2016 2016 ...
## $ month_name: chr [1:64] "Feb" "Apr" "Jun" "Aug" ...
## $ cpi : num [1:64] 242 244 246 245 244 ...
## $ month : int [1:64] 2 4 6 8 10 12 2 4 6 8 ...
## NULL
print(head(cpi_long))
## # A tibble: 6 × 4
## year month_name cpi month
## <int> <chr> <dbl> <int>
## 1 2015 Feb 242. 2
## 2 2015 Apr 244. 4
## 3 2015 Jun 246. 6
## 4 2015 Aug 245. 8
## 5 2015 Oct 244. 10
## 6 2015 Dec 242. 12
# Check Unemployment Long Format
cat("Unemployment Long Format")
## Unemployment Long Format
print(str(unemployment_long))
## tibble [128 × 4] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:128] 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ month_name : chr [1:128] "Jan" "Feb" "Mar" "Apr" ...
## $ unemployment_rate: num [1:128] 6.2 6 5.7 5.2 5.7 5.7 6.1 5.7 5.1 4.9 ...
## $ month : int [1:128] 1 2 3 4 5 6 7 8 9 10 ...
## NULL
print(head(unemployment_long))
## # A tibble: 6 × 4
## year month_name unemployment_rate month
## <int> <chr> <dbl> <int>
## 1 2015 Jan 6.2 1
## 2 2015 Feb 6 2
## 3 2015 Mar 5.7 3
## 4 2015 Apr 5.2 4
## 5 2015 May 5.7 5
## 6 2015 Jun 5.7 6
# Check Eviction Data
cat("Eviction Trends Final After Macro UI Merge & Lag")
## Eviction Trends Final After Macro UI Merge & Lag
print(str(eviction_trends_final))
## 'data.frame': 70 obs. of 32 variables:
## $ month : chr "01/2020" "02/2020" "03/2020" "04/2020" ...
## $ month_last_day : logi NA NA NA NA NA NA ...
## $ month_filings : int 2015 1720 813 0 0 0 198 176 701 792 ...
## $ avg_filings : num 1294 1194 1142 925 1065 ...
## $ percentage_diff : num 1.557 1.441 0.712 0 0 ...
## $ month_num : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ month_date : chr "2020-1-01" "2020-2-01" "2020-3-01" "2020-4-01" ...
## $ month_name.x : chr "Jan" "Feb" "Mar" "Apr" ...
## $ unemployment_rate : num 4.3 4.2 4.9 13.9 13.4 12.9 12 10.4 8.6 7.4 ...
## $ month_name.y : chr NA "Feb" NA "Apr" ...
## $ cpi : num NA 260 NA 256 NA ...
## $ City : chr "Philadelphia, PA" "Philadelphia, PA" "Philadelphia, PA" "Philadelphia, PA" ...
## $ Overall : num 628 628 628 628 628 628 628 628 628 628 ...
## $ White areas : num 703 703 703 703 703 703 703 703 703 703 ...
## $ Nonwhite areas : num 571 571 571 571 571 571 571 571 571 571 ...
## $ col_1 : num NA NA NA NA NA NA NA NA NA NA ...
## $ % with : num 0.466 0.466 0.466 0.466 0.466 ...
## $ Median amount among those with : num 1476 1476 1476 1476 1476 ...
## $ col_2 : num NA NA NA NA NA NA NA NA NA NA ...
## $ % with home foreclosure : num 0.00435 0.00435 0.00435 0.00435 0.00435 ...
## $ % housing-cost burdened, low-income : num 0.68 0.68 0.68 0.68 0.68 ...
## $ % unbanked, metro area : num 0.041 0.041 0.041 0.041 0.041 ...
## $ % with health insurance coverage : num 0.857 0.857 0.857 0.857 0.857 ...
## $ % received EITC, low-income : num 0.415 0.415 0.415 0.415 0.415 ...
## $ Unemployment rate : num 0.109 0.109 0.109 0.109 0.109 ...
## $ Labor force participation rate : num 0.604 0.604 0.604 0.604 0.604 ...
## $ % below 200% of federal poverty level: num 0.46 0.46 0.46 0.46 0.46 ...
## $ % population change, 2000–15 : num 0.0329 0.0329 0.0329 0.0329 0.0329 ...
## $ Gini index of income inequality : num 0.505 0.505 0.505 0.505 0.505 ...
## $ l1_filings : int NA 2015 1720 813 0 0 0 198 176 701 ...
## $ l1_percentage_diff : num NA 1.557 1.441 0.712 0 ...
## NULL
print(head(eviction_trends_final))
## month month_last_day month_filings avg_filings percentage_diff month_num
## 1 01/2020 NA 2015 1294.5 1.5565856 1
## 2 02/2020 NA 1720 1193.5 1.4411395 2
## 3 03/2020 NA 813 1142.5 0.7115974 3
## 4 04/2020 NA 0 925.0 0.0000000 4
## 5 05/2020 NA 0 1065.0 0.0000000 5
## 6 06/2020 NA 0 1032.0 0.0000000 6
## year month_date month_name.x unemployment_rate month_name.y cpi
## 1 2020 2020-1-01 Jan 4.3 <NA> NA
## 2 2020 2020-2-01 Feb 4.2 Feb 259.694
## 3 2020 2020-3-01 Mar 4.9 <NA> NA
## 4 2020 2020-4-01 Apr 13.9 Apr 256.353
## 5 2020 2020-5-01 May 13.4 <NA> NA
## 6 2020 2020-6-01 Jun 12.9 Jun 257.983
## City Overall White areas Nonwhite areas col_1 % with
## 1 Philadelphia, PA 628 703 571 NA 0.4660742
## 2 Philadelphia, PA 628 703 571 NA 0.4660742
## 3 Philadelphia, PA 628 703 571 NA 0.4660742
## 4 Philadelphia, PA 628 703 571 NA 0.4660742
## 5 Philadelphia, PA 628 703 571 NA 0.4660742
## 6 Philadelphia, PA 628 703 571 NA 0.4660742
## Median amount among those with col_2 % with home foreclosure
## 1 1476 NA 0.004345603
## 2 1476 NA 0.004345603
## 3 1476 NA 0.004345603
## 4 1476 NA 0.004345603
## 5 1476 NA 0.004345603
## 6 1476 NA 0.004345603
## % housing-cost burdened, low-income % unbanked, metro area
## 1 0.6803305 0.041
## 2 0.6803305 0.041
## 3 0.6803305 0.041
## 4 0.6803305 0.041
## 5 0.6803305 0.041
## 6 0.6803305 0.041
## % with health insurance coverage % received EITC, low-income
## 1 0.8566335 0.4151677
## 2 0.8566335 0.4151677
## 3 0.8566335 0.4151677
## 4 0.8566335 0.4151677
## 5 0.8566335 0.4151677
## 6 0.8566335 0.4151677
## Unemployment rate Labor force participation rate
## 1 0.1093104 0.6044449
## 2 0.1093104 0.6044449
## 3 0.1093104 0.6044449
## 4 0.1093104 0.6044449
## 5 0.1093104 0.6044449
## 6 0.1093104 0.6044449
## % below 200% of federal poverty level % population change, 2000–15
## 1 0.4603734 0.03287668
## 2 0.4603734 0.03287668
## 3 0.4603734 0.03287668
## 4 0.4603734 0.03287668
## 5 0.4603734 0.03287668
## 6 0.4603734 0.03287668
## Gini index of income inequality l1_filings l1_percentage_diff
## 1 0.5049 NA NA
## 2 0.5049 2015 1.5565856
## 3 0.5049 1720 1.4411395
## 4 0.5049 813 0.7115974
## 5 0.5049 0 0.0000000
## 6 0.5049 0 0.0000000
This step constructs the modeling dataset used for regression analysis. Each row in the resulting dataset represents a single month, with aligned economic, financial, and temporal predictors. The dependent variable, percentage_diff, measures how eviction filings in a given month deviate from a historical average, making it a continuous outcome well suited for linear regression. Predictor variables include lagged eviction activity (l1_percentage_diff) to capture temporal lag, macroeconomic indicators (unemployment_rate, cpi) to reflect short-term economic conditions, and Urban Institute financial health indicators that quantify underlying household vulnerability. Observations with missing outcome values are removed to ensure valid model estimation and consistent sample sizes across all predictors.
Ordinary Least Squares (OLS) regression is an appropriate and intentional modeling choice for this dataset given both the structure of the outcome and the nature of the predictors. The dependent variable is continuous and approximately symmetric around its mean, rather than binary or integer-count based, ruling out logistic and Poisson-style models. The predictors are a mix of continuous economic measures and proportion-based indicators, all of which naturally conform to the linear framework assumed by OLS. OLS also offers transparent coefficient interpretation, allowing each variable’s association with eviction deviations to be directly assessed, an important requirement for a policy-focused project. Additionally, OLS performs well with moderate sample sizes and does not impose unnecessary distribution assumptions beyond linearity and conditional independence. This makes OLS regression suitable for evaluating how economic disruptions and financial vulnerability jointly shape month-to-month eviction dynamics.
# Prepare modeling dataset
model_df <- eviction_trends_final %>%
select(
year, month_num,
month_filings,
percentage_diff,
l1_percentage_diff,
unemployment_rate,
cpi,
Overall,
`White areas`,
`Nonwhite areas`,
`% with`,
`Median amount among those with`,
`% housing-cost burdened, low-income`,
`% with home foreclosure`,
`% unbanked, metro area`,
`% received EITC, low-income`,
`Labor force participation rate`,
`% below 200% of federal poverty level`,
`Gini index of income inequality`
) %>%
filter(!is.na(percentage_diff))
Here the model is fit with the predictors.
The regression results revealed an important characteristic of the Urban Institute (UI) financial health variables. Although they are meaningful, they do not vary within the temporal unit used for modeling. Because remain constant across all rows in the monthly dataset, these variables are perfectly collinear with the model intercept and are therefore dropped during estimation due to singularities. This behavior is expected under Ordinary Least Squares. More time would be needed to fully assess and correct this issue or to source new data with the same measures. Despite not receiving individual coefficient estimates, the UI variables still play an important conceptual role in the analysis. They provide context for the economic environment in which evictions occur, anchoring short-term fluctuations in broader measures of financial vulnerability such as housing cost burden, foreclosure prevalence, and income inequality. Rather than being interpreted as predictors, these variables help frame the limits of month-to-month responsiveness by describing the underlying baseline risk faced by households in Philadelphia.
eviction_model <- lm(
percentage_diff ~
l1_percentage_diff +
unemployment_rate +
cpi +
`Median amount among those with` +
`% housing-cost burdened, low-income` +
`% with home foreclosure` +
`% unbanked, metro area` +
`% received EITC, low-income` +
`Labor force participation rate` +
`% below 200% of federal poverty level` +
`Gini index of income inequality`,
data = model_df
)
summary(eviction_model)
##
## Call:
## lm(formula = percentage_diff ~ l1_percentage_diff + unemployment_rate +
## cpi + `Median amount among those with` + `% housing-cost burdened, low-income` +
## `% with home foreclosure` + `% unbanked, metro area` + `% received EITC, low-income` +
## `Labor force participation rate` + `% below 200% of federal poverty level` +
## `Gini index of income inequality`, data = model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39521 -0.06102 -0.00346 0.09482 0.57471
##
## Coefficients: (8 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.5103766 0.6122573 0.834 0.411096
## l1_percentage_diff 0.5408035 0.1254763 4.310 0.000162
## unemployment_rate -0.0663023 0.0192283 -3.448 0.001694
## cpi 0.0006943 0.0019027 0.365 0.717740
## `Median amount among those with` NA NA NA NA
## `% housing-cost burdened, low-income` NA NA NA NA
## `% with home foreclosure` NA NA NA NA
## `% unbanked, metro area` NA NA NA NA
## `% received EITC, low-income` NA NA NA NA
## `Labor force participation rate` NA NA NA NA
## `% below 200% of federal poverty level` NA NA NA NA
## `Gini index of income inequality` NA NA NA NA
##
## (Intercept)
## l1_percentage_diff ***
## unemployment_rate **
## cpi
## `Median amount among those with`
## `% housing-cost burdened, low-income`
## `% with home foreclosure`
## `% unbanked, metro area`
## `% received EITC, low-income`
## `Labor force participation rate`
## `% below 200% of federal poverty level`
## `Gini index of income inequality`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1821 on 30 degrees of freedom
## (36 observations deleted due to missingness)
## Multiple R-squared: 0.7975, Adjusted R-squared: 0.7773
## F-statistic: 39.39 on 3 and 30 DF, p-value: 1.59e-10
In-sample prediction error metrics were performed because the model includes lagged outcomes and economic predictors that are not observed for all periods, the regression was estimated using complete cases only. Validation was therefore first conducted on the same set of observations used in model estimation to ensure consistency between predicted and observed outcomes. Fitted values and residuals were extracted directly from the regression object to avoid misalignment caused by missing values.
Two standard error metrics were used to summarize model accuracy: Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). MAE provides a measure of average deviation between predicted and observed eviction pressure, while RMSE penalizes larger errors more heavily and highlights sensitivity to extreme under- or over-prediction. Together, these metrics provide a concise assessment of model fit and establish a baseline level of predictive performance prior to out-of-sample validation.
# Model Validation: RMSE and MAE
# Extract the data actually used by the model because of UI data issues
model_used <- eviction_model$model
# Extract fitted values and residuals
predicted <- eviction_model$fitted.values
residuals <- eviction_model$residuals
# Compute MAE
mae <- mean(abs(residuals))
# Compute RMSE
rmse <- sqrt(mean(residuals^2))
# Output results
cat("Model Performance Metrics:\n")
## Model Performance Metrics:
cat("MAE =", round(mae, 4), "\n")
## MAE = 0.1225
cat("RMSE =", round(rmse, 4), "\n")
## RMSE = 0.1711
This step implements a time-based train–test split to evaluate the model’s out-of-sample predictive performance. Observations are first ordered chronologically using numeric year and month variables. The dataset is then divided such that all observations prior to 2023 are used for model training, while observations from 2023 onward are reserved for testing. This corresponds to using approximately 75–80% of the available time series for training and the remaining 20–25% for out-of-sample evaluation, depending on the exact temporal coverage of the dataset. Unlike random splits, this approach mirrors real-world forecasting conditions by ensuring the model is evaluated only on future observations it has not seen during estimation.
This time-based split is intentionally chosen because the outcome of interest, (monthly changes in eviction activity) is inherently temporal. Randomly assigning observations to training and testing would risk temporal leakage, where the model indirectly learns from future data points. By training on earlier years and testing on later ones, this validation strategy provides a more realistic assessment of how well the model generalizes to new economic conditions. After splitting, complete-case filtering is applied consistently to both training and test sets to ensure the same variables are available in each, preventing estimation or prediction errors caused by missing values.
# Train–Test Split
# Make numeric time variables
model_df_tt <- model_df %>%
mutate(
year_num = as.integer(year),
month_num = if ("month_num" %in% names(.)) month_num else as.integer(substr(month, 1, 2))
) %>%
arrange(year_num, month_num)
# Train–test split by time
train_df <- model_df_tt %>% filter(year_num < 2023)
test_df <- model_df_tt %>% filter(year_num >= 2023)
# Variables used in the model
model_vars <- c(
"percentage_diff",
"l1_percentage_diff",
"unemployment_rate",
"cpi",
"Median amount among those with",
"% housing-cost burdened, low-income",
"% with home foreclosure",
"% unbanked, metro area",
"% received EITC, low-income",
"Labor force participation rate",
"% below 200% of federal poverty level",
"Gini index of income inequality"
)
# Keep only complete cases
train_df_clean <- train_df %>%
filter(if_all(all_of(model_vars), ~ !is.na(.)))
test_df_clean <- test_df %>%
filter(if_all(all_of(model_vars), ~ !is.na(.)))
We now fit the new split data on a model.
# Fit Model on Training Data
eviction_model_train <- lm(
percentage_diff ~
l1_percentage_diff +
unemployment_rate +
cpi +
`Median amount among those with` +
`% housing-cost burdened, low-income` +
`% with home foreclosure` +
`% unbanked, metro area` +
`% received EITC, low-income` +
`Labor force participation rate` +
`% below 200% of federal poverty level` +
`Gini index of income inequality`,
data = train_df_clean
)
This step evaluates the predictive accuracy of the trained eviction model using the test dataset, which contains observations not used during model estimation. The model is applied to the test data to generate predicted values of monthly eviction changes, and these predictions are then compared to the observed outcomes. The differences between predicted and actual values (residuals) quantify how far the model’s forecasts deviate from reality when applied out of sample.
To summarize prediction error, we calculate Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE). MAE captures the average size of prediction mistakes, while RMSE places greater weight on larger errors, highlighting periods where the model performs particularly poorly. Reporting both metrics provides a robust assessment of out-of-sample performance and demonstrates that the model is not only statistically meaningful in-sample but also useful for forecasting future eviction trends.
# Out-of-Sample Validation - Test Set
# Predict on test set
test_predictions <- predict(
eviction_model_train,
newdata = test_df_clean
)
# Residuals
test_residuals <- test_df_clean$percentage_diff - test_predictions
# MAE
mae_test <- mean(abs(test_residuals))
# RMSE
rmse_test <- sqrt(mean(test_residuals^2))
# Output
cat("Out-of-Sample Model Performance:\n")
## Out-of-Sample Model Performance:
cat("MAE =", round(mae_test, 4), "\n")
## MAE = 0.1532
cat("RMSE =", round(rmse_test, 4), "\n")
## RMSE = 0.1833
Here a new data frame is created to chart the difference between the in-sample, and out-of-sample error analyses.
# Prepare error data for comparison
# In-sample errors
in_sample_errors <- data.frame(
residual = eviction_model$residuals,
sample_type = "In-sample"
)
# Out-of-sample errors
out_sample_errors <- data.frame(
residual = test_residuals,
sample_type = "Out-of-sample"
)
# Combine
error_compare_df <- rbind(in_sample_errors, out_sample_errors)
# Plot
ggplot(error_compare_df, aes(x = residual, fill = sample_type)) +
geom_histogram(
alpha = 0.6,
bins = 30,
position = "identity"
) +
facet_wrap(~ sample_type, scales = "free_y") +
labs(
title = "Comparison of Prediction Errors",
subtitle = "In-sample vs Out-of-sample Residual Distributions",
x = "Prediction Error (Residual)",
y = "Count"
) +
theme_minimal()